Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the formatting for Column #11724

Closed
wants to merge 6 commits into from

Conversation

goldmedal
Copy link
Contributor

Which issue does this PR close?

No corresponding issue.

Rationale for this change

Consider the following case:

let plan = LogicalPlanBuilder::table_scan(Some("t"), &schema, None).unwrap().build().unwrap();
let projection = LogicalPlan::Projection(Projection::try_new(vec![col("t.id"), ident("t.id")], Arc::new(plan)).unwrap());

Before this change, if we try to format Column, the result is:

Projection: t.id, t.id
  TableScan: t

We can't distinguish between col("t.id") and ident("t.id").

Now, the result will be:

Projection: t.id, "t.id"
  TableScan: t

This change is beneficial for debugging.

What changes are included in this PR?

  • Change the formatting behavior.

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 30, 2024
@goldmedal goldmedal marked this pull request as draft July 30, 2024 16:35
Comment on lines +2755 to +2758
Expr::Alias(alias) => alias.expr.display_name(),
_ => expr.display_name(),
}
.ok()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether this makes sense, but we need a name for the function dependencies, so I chose a name that wouldn't be affected by Display.

By the way, I'm confused about why we have so many different names or similar display methods for Expr. Maybe we should organize them or name them more clearly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11782 filed

@github-actions github-actions bot added optimizer Optimizer rules sql SQL Planner substrait labels Jul 31, 2024
@goldmedal goldmedal marked this pull request as ready for review August 2, 2024 09:31
@@ -372,7 +372,7 @@ impl FromStr for Column {

impl fmt::Display for Column {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.flat_name())
write!(f, "{}", self.quoted_flat_name())
Copy link
Contributor

@jayzhan211 jayzhan211 Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought identifier are the one to be double quoted in your ticket 😕, but it seems you are modifying column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, an identifier is an unqualified column. Are they different? 🤔

Copy link
Contributor

@jayzhan211 jayzhan211 Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifier could includes dot, that is the reason we need double quote to differentiate it from qualified column. Identifier with dot is relatively rare compare to column, so we don't have double quote every where.

In duckdb, they also has double quote for alias name (identifier).

D select t.a from t;
┌───────┐
│   a   │
│ int32 │
├───────┤
│     1 │
└───────┘
D select t.a as t.a from t;
Parser Error: syntax error at or near "."
LINE 1: select t.a as t.a from t;
                       ^
D select t.a as "t.a" from t;
┌───────┐
│  t.a  │
│ int32 │
├───────┤
│     1 │
└───────┘

If we want to introduce dot notation for struct a.b, it is consistent to duckdb if we don't have double quote for column.

https://duckdb.org/docs/sql/data_types/struct.html

Copy link
Contributor

@jayzhan211 jayzhan211 Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only adding double quote for identifier that includes dot is probably a better idea.

If we have column that has name includes dot, we display it like qualifier."name.with.dot"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifier could includes dot, that is the reason we need double quote to differentiate it from qualified column. Identifier with dot is relatively rare compare to column, so we don't have double quote every where.

Agreed. That makes sense to me. A Column comprises multiple identifiers: catalog.schema.table.column. All the parts are identifiers. However, because we don't have an Expr for an Identifier, I can only handle it for Column. Currently, we represent an identifier as an unqualified column.

pub fn ident(name: impl Into<String>) -> Expr {
Expr::Column(Column::from_name(name))
}

Ideally, all of the names should be identifiers (e.g., table name, column name, function name, or alias) that should be quoted if needed. This purpose will impact many parts.

I think only adding double quote for identifier that includes dot is probably a better idea.

If we have column that has name includes dot, we display it like qualifier."dot.column"

I believe quoted_flat_name() that will invoke quote_identifier handles this logic well.

pub fn quote_identifier(s: &str) -> Cow<str> {
if needs_quotes(s) {
Cow::Owned(format!("\"{}\"", s.replace('"', "\"\"")))
} else {
Cow::Borrowed(s)
}
}
/// returns true if this identifier needs quotes
fn needs_quotes(s: &str) -> bool {
let mut chars = s.chars();
// first char can not be a number unless escaped
if let Some(first_char) = chars.next() {
if !(first_char.is_ascii_lowercase() || first_char == '_') {
return true;
}
}
!chars.all(|c| c.is_ascii_lowercase() || c.is_ascii_digit() || c == '_')
}

This function doesn't always quote the identifier. Only quote it if it contains invalid characters, such as a dot, bracket, or uppercase letter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only adding double quote for identifier that includes dot is probably a better idea.

If we have column that has name includes dot, we display it like qualifier."name.with.dot"

I'm rethinking what you said. I did a test for the function name using the original method (flat_name).

        let schema = Schema::new(vec![
            Field::new("id", DataType::Int32, false),
            Field::new("MAX(id)", DataType::Int32, false),
            Field::new("state", DataType::Utf8, false),
        ]);

        let plan = table_scan(Some("t"), &schema, None)
            .unwrap()
            .filter(col("state").eq(lit("CO")))
            .unwrap()
            .project(vec![col("MAX(id)"), max(col("id"))])
            .unwrap()
            .build()
            .unwrap();

-----
 right: "Projection: t.MAX(id), MAX(t.id)\n  Filter: t.state = Utf8(\"CO\")\n    TableScan: t"

We can distinguish between them easily.
Indeed, maybe we don't need to quote the identifier if it contains a bracket. Only quoting it for the dot might be simpler. 🤔

@github-actions github-actions bot removed sql SQL Planner optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait labels Aug 3, 2024
Comment on lines +160 to +169
fn quoted_flat_name_if_contain_dot(&self) -> String {
match &self.relation {
Some(r) => format!(
"{}.{}",
table_reference_to_quoted_string(r),
quoted_if_contain_dot(&self.name)
),
None => quoted_if_contain_dot(&self.name).to_string(),
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 Before I fix other tests, I want to check if this behavior makes sense. (It involves too many tests 😢 ).
Now, we only quote an identifier if it contains the dot. However, some cases like sum(t1.c1) will also be quoted, even if it's a function call. I think it's not worth doing more checking to exclude this kind of case. What do you think?

Copy link
Contributor

@jayzhan211 jayzhan211 Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not ideal if sum(t1.c1) is quoted 🤔 . I hope the change is as small as possible, so I would prefer to keep function or others Expr remain the same, only identifier with dot is quoted.

We could also hold on and wait for more input from other's about the change of this, given the change of this is not trivial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of modifying Column, we should modify the display_name for Expr, so if we found column inside ScalarFunction, we could skip the double quote anyway. (by something like boolean flag?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of modifying Column, we should modify the display_name for Expr, so if we found column inside ScalarFunction, we could skip the double quote anyway. (by something like boolean flag?)

I'm not sure if it's that simple 🤔. In my experience, the column might look like this:

Column { qualifier: None, name: "sum(t1.c1)" }

I think it's hard to find a consistent pattern for it because we use many Column::from_name calls to create projections. For example, in

.map(|x| Column::from_name(self.ident_normalizer.normalize(x)))

the column name could be complex and unruly.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 4, 2024

I rethink about what the issue we are solving, so I would like to make sure we have such issue in practice. Is there any valid query that shows the ambiguity between the identifier and column?

statement ok
create table t1(a int, "t1.a" int) as values (100, 1), (101, 2), (102, 3), (101, 2);

query II
select t1.a, "t1.a" from t1;
----
100 1
101 2
102 3
101 2

query TT
explain select t1.a, "t1.a" from t1; 
----
logical_plan TableScan: t1 projection=[a, t1.a]
physical_plan MemoryExec: partitions=1, partition_sizes=[1]
plan: CreateMemoryTable: Bare { table: "t1" }
  Projection: CAST(column1 AS Int32) AS a, CAST(column2 AS Int32) AS t1.a
    Values: (Int64(100), Int64(1)), (Int64(101), Int64(2)), (Int64(102), Int64(3)), (Int64(101), Int64(2))
plan: Projection: t1.a, t1.t1.a
  TableScan: t1
plan: Explain
  Projection: t1.a, t1.t1.a
    TableScan: t1
query error DataFusion error: Schema error: Schema contains qualified field name t1\.a and unqualified field name a which would be ambiguous
select t1.a, "t1.a" as a from t1;

I guess it is not possible to project with the same name, thus the issue you mentioned probably not exist 🤔 ?

@goldmedal
Copy link
Contributor Author

I rethink about what the issue we are solving, so I would like to make sure we have such issue in practice. Is there any valid query that shows the ambiguity between the identifier and column?
I guess it is not possible to project with the same name, thus the issue you mentioned probably not exist 🤔 ?

Indeed, I think the final plan generated by the planner is fine. I tried many SQL cases, and none of them generated the issue I mentioned.

I encountered this issue when I used LogicalPlanBuilder to build the plan directly while implementing some analyzer rules (e.g., the expand wildcard rule) to transform the plans. When I print the intermediate plan to debug or check something, I get confused about the columns if they aren't qualified.

@goldmedal
Copy link
Contributor Author

I noticed that #11774 changes the debugging behavior of the plan. I guess it might improve the issue I encountered.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2024

What is the status of this PR? CI seems not to be passing and #11724 (comment) suggests maybe you don't think it is needed anymore @goldmedal ?

Marking it as a draft to signify it isn't waiting on review anymore

@alamb alamb marked this pull request as draft August 8, 2024 16:07
@goldmedal
Copy link
Contributor Author

What is the status of this PR? CI seems not to be passing and #11724 (comment) suggests maybe you don't think it is needed anymore @goldmedal ?

Marking it as a draft to signify it isn't waiting on review anymore

Yes, this change impacts too many parts, and it won't occur in any valid query (#11724 (comment)). #11774 has resolved the issue that occurred. I think we can close this PR.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2024

Thanks anyways for trying to make the code better @goldmedal

@alamb alamb closed this Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants